The Binomial

For cases where we have things that can be put into two groups.

\[ P(x=K|N,p) = \frac{N!}{ K! (N-K)!}p^K(1-p)^{N-K} \]

Used as the generative function for Bayesian Inferences.

The Binomial Test

Can be used to test if a set of data are equivallent to an expectation given by the underlying probability of seeing an item in “Group 1” (catfish in this example):

 

\(H_O: p = \hat{p}\)

 

For this, the test statistics is defined as the number of observations in “Group 1” (e.g., \(N_{catfish}\) in this example), and what we know about the probability of selecting items that have two distinct states is used to evaluate the likelihood of getting \(N_{catfish}\).

The Data

Borrowing from a previous example, let’s assume we are sampling fish in the James River. The hypothesis is that the frequency of catfish is

\[ p_{catfish} = \frac{N_{catfish}}{N_{catfish} + N_{other}} = \frac{37}{50} \approx 0.74 \]

So if if we have another set of sample, we can test the null hypothesis:

\(H_O: p = 0.74\)

Using the new counts of \(N_{catfish}\) and \(N_{other}\).

Built-In Testing Function

R has a direct function for this:

 

That allows testing two-tailed or single-sided alternative hypotheses.

Example With New Data

So, let’s assume you went out and sampled new data and found

catfish <- 52
non_catfish <- 47 

 

To test if this sampling exercise is consistent with the previous assumption of a 74% catfish presence, we can specify the data as:

samples <- c( catfish, non_catfish )
p_catfish <- 0.74

The Binomial Test

fit <- binom.test( samples, p = p_catfish)
fit

    Exact binomial test

data:  samples
number of successes = 52, number of trials = 99, p-value = 5.007e-06
alternative hypothesis: true probability of success is not equal to 0.74
95 percent confidence interval:
 0.4223973 0.6265570
sample estimates:
probability of success 
             0.5252525 

Conclusion?

Contained Data

As normal, the object that is returned by the function has all the necessary items for you to include the data into your manuscript.

names(fit)
[1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
[6] "null.value"  "alternative" "method"      "data.name"  

Multinomial Extension

When we have more than two unique states, we can extend the Binomial test to a more generalized Multinomial expansion. This would be used to test the hypothesis about specific probabilities for \(K-1\) states (the number of independently assorting groups). However, this is a specific version of a more generalilzed approach using Contingency Tables and I won’t be going over it here.

However, there are a few libraries in R available if you’d prefer to use the multinomial expansion.

Contingency Tables

Contingency tables are another counting approach, where your data can be classified into one or more categories. Often it is taken as a comparison between populations. So, in keeping with our fish theme, let’s assume we went to another river and did another sampling run on catfish.

Nomenclature

This is defined as an \(RxC\) Contingency Table, where \(R\) is the number of Rows and \(C\) is the number of columns.

Underlying Hypothesis

For this, we are assuming that the observations in the row are taken from a larger background population.

  • \(p_{catfish,\; pop1} = \frac{O_{11}}{n_1}\)
  • \(p_{other,\; pop1} = \frac{O_{12}}{n_1}\)
  • \(p_{catfish} + p_{other} = 1.0\)

The same applies for samples from Population 2.

\(H_O: p_{catfish, \; pop1} = p_{catfish, \; pop2}\)

Larger Configurations

If we had more than just two Groups, we expand these to the vector of probabilities by row.

\(H_O: \vec{p}_i = \vec{p}_j\)

 

And the same applies for more than just two rows.

Test Statistic

The test statistic for this is based upon:

The distance between the observed and expectations for the data.

Observed

The count of items in each combination of categories. - \(O_{11}\), \(O_{21}\), etc.

Expected

Based on number of categories OR external information.

Observed Values - Example Data

Let’s assume I went out and sampled 50 more fishes and my overall data are now:

Observed <- matrix( c(52, 47, 32, 18), nrow=2, byrow = TRUE)
rownames( Observed ) <- c("Population 1", "Population 2")
colnames( Observed ) <- c("Catfish", "Other")
Observed
             Catfish Other
Population 1      52    47
Population 2      32    18

Defining Expectations - No Information

If we have no data, and the underlying hypothesis is that samples may be put into any of the Groups randomly (e.g., the underlying assumption that all fish species have equal abundance), then the expecation is that the frequencies of observations would be \(\frac{1}{K}\), where \(K\) is the number of groups.

Expectations - No Information

n1 <- 52 + 47
n2 <- 32 + 18 
Expected <- matrix( c( 0.5 * n1, 
                       0.5 * n1, 
                       0.5 * n2,
                       0.5 * n2), nrow=2, byrow = TRUE)
rownames( Expected ) <- c("Population 1", "Population 2")
colnames( Expected ) <- c("Catfish", "Other")
Expected
             Catfish Other
Population 1    49.5  49.5
Population 2    25.0  25.0

Expecations - Prior Information

In this example, we have a prior expectation that:

\[ p_{catfish} = 0.74 \]

\[ p_{other} = 1.0 - 0.74 = 0.26 \]

Expecations - Prior Information

Expected <- matrix( c( 0.74 * n1, 
                       (1-0.74) * n1, 
                       0.74 * n2,
                       (1-0.74) * n2), nrow=2, byrow = TRUE)
rownames( Expected ) <- c("Population 1", "Population 2")
colnames( Expected ) <- c("Catfish", "Other")
Expected
             Catfish Other
Population 1   73.26 25.74
Population 2   37.00 13.00

Observations and Expectations

The test statistic should measure how close the data in these two matrices are.

 

Observed
             Catfish Other
Population 1      52    47
Population 2      32    18
Expected
             Catfish Other
Population 1   73.26 25.74
Population 2   37.00 13.00

 

  • Small deviations \(\to\) no differences and failure to reject \(H_O\).
  • Big deviations \(\to\) rejection of \(H_O\).

Test Statistic

This is based upon an approximation to the \(\chi^2\) distribution, which is well behaved and described for counting data.

 

\[ T = \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

Estimating Directly

I’m going to show you how to estimate this from the raw count data so we can get an example where you have to look up the probability for an underlying distribution. There is a chisq.test() function, but you have to be very careful that you have it set up properly for contingency table analyses.

T <- sum( (Observed - Expected)^2 / Expected  )
T
[1] 26.32813

Rejecting the Null Hypothesis

The general contingency table approach is based upon \(df = (r-1)(c-1)\) independent elements and is distributed as a \(\chi^2\) random variable. In a classic ‘hypothesis testing’ framework with \(\alpha = 0.05\), we would estimate the critical value of the test statistic as:

alpha <- 0.05
p <- 1.0 - alpha 
critical_value <- qchisq(p, df=1 )
critical_value
[1] 3.841459

The Probability of the Test Statistic

To get the actual probabilty associated with $T =$26.33, we pass the test statistic to the quantile function for the \(\chi^2\) distribution.

p.value <- 1.0 - pchisq( T, df=1 )
p.value
[1] 2.88063e-07

Tabulating Data

  Sample Species         Site Measurement
1      1 Catfish Population 1    4.186835
2      2 Catfish Population 1   16.286218
3      3 Catfish Population 1    7.357911
4      4 Catfish Population 1   13.176415
5      5 Catfish Population 1    9.539340
6      6 Catfish Population 1   12.760953

Tabulating Data

The table() function can take the columns of factors and provide for you the observed matrix of counts. If you name your factors descriptively, it will take care of the row and column headers for you.

table( df$Site, df$Species ) -> Obs
Obs
              
               Catfish Other
  Population 1      52    47
  Population 2      32    18

What do we do when our data are not normal or heteroscedastistic?

Non-Parametric Analyses

There are equivalent tests (ok, not equivallent but can be used to answer a similar question) that do not have assumptions of normality, homoscedasticity, etc.

Important

These are largely based upon Ranks and Combinatory Theory.

You gain the applicability with a potential loss of information (and hence statistical power).

Non-Parametric Correlation

The normal cor.test() function has an option to swap out the Pearson-Product Moment approach for one based on Ranks.

Example Comparison

I’m going to use the Iris data set to look at a correlation for relatively poorly behaved data.

Example Comparison

fit_pearson <- cor.test( iris$Sepal.Length, iris$Sepal.Width )
fit_kendall <- cor.test( iris$Sepal.Length, iris$Sepal.Width, method = "kendall" )
fit_spearman <- cor.test( iris$Sepal.Length, iris$Sepal.Width, method = "spearman" )
data.frame( Model=c("Pearson","Kendall","Spearman"), 
            Parameter = c("r","tau","rho"),
            R = c(fit_pearson$estimate, fit_kendall$estimate, fit_spearman$estimate),
            P = c(fit_pearson$p.value, fit_kendall$p.value, fit_spearman$p.value ) ) -> df 

Example Comparison

Here is the difference in the models (the ties impacted the Spearman p.value estimate).

df |>
  kable( digits = 3, row.names = FALSE) |> 
  kable_classic(  full_width=FALSE )
Model Parameter R P
Pearson r -0.118 0.152
Kendall tau -0.077 0.183
Spearman rho -0.167 0.041

Non-Parametric T-Test: Wilcoxon Test

This is an extension of the t-test for the equivalence of the mean values of two data sets (e.g., \(H_O: \bar{x} = \bar{y}\)).

Wilcoxon Test Assumptions

This approach has the following assumptions:

  1. Both sets of data are random samples from their respective population.
  2. Each of the populations are similarly independent.
  3. The measurement scale for each variable is at least ordinal (e.g., you can put them in order).

Example Data

mtcars |>
  select( Transmission = am, MPG = mpg ) |>
  mutate( Transmission = as.factor( Transmission ) ) |>
  mutate( Transmission = fct_recode( Transmission, "Automatic"="0", "Manual"="1")) -> data

data |>
  ggplot( aes(MPG, fill=Transmission) )  + 
  geom_density( alpha = 0.5 )

Wilcoxon Test

To test this, we use the Wilcoxon Rank Sum test. The signature of this test is:

Wilcoxon Test

This will take the full set of data, assign them ranked values, and then determine if the ranked values are randomly distributed between the two groups.

fit <- wilcox.test( MPG ~ Transmission, data=data)
fit 

    Wilcoxon rank sum test with continuity correction

data:  MPG by Transmission
W = 42, p-value = 0.001871
alternative hypothesis: true location shift is not equal to 0

Contained Values

As for other analyses, the object returned has a number of elements that may be helpful in writing about the analysis.

names(fit)
[1] "statistic"   "parameter"   "p.value"     "null.value"  "alternative"
[6] "method"      "data.name"  

Non-Parametric ANOVA

The Kruskal-Wallis test is an extension on the Wilcoxon test in the same way that the aov() is an extension of the t-test(). The approach is again based upon ranks. Here is an example data set we’ve used once before based on air quality measured across five different months.

airquality |>
  select( Month, Ozone ) |> 
  mutate( Month = factor( Month ) ) |> 
  mutate( Month = fct_recode(Month, 
                             May="5",
                             Jun="6",
                             Jul="7",
                             Aug="8",
                             Sep="9" )) |> 
  filter( !is.na( Ozone )) -> df.air 

Kruskal-Wallis Data

df.air |>
  ggplot(aes(Month, Ozone) ) + 
  geom_boxplot(notch=TRUE)

Kruskal-Wallis Tests

To test for the equality of ozone across months, we use kruskal.test()

fit <- kruskal.test( Ozone ~ Month, data=df.air )
fit 

    Kruskal-Wallis rank sum test

data:  Ozone by Month
Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06

Contained Values

As for other analyses, the object returned has a number of elements that may be helpful in writing about the analysis.

names(fit)
[1] "statistic" "parameter" "p.value"   "method"    "data.name"

Questions?

If you have any questions, please feel free to post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored